Through the Looking Glass
November 12, 2024
by Gérard Biau and Erwan Scornet [1]
Each tree estimates the response at point \(x\) as: \[ m_n(x; \Theta_j, D_n) = \frac{\sum_{i \in D_n(\Theta_j)} \mathbf{1}_{X_i \in A_n(x; \Theta_j, D_n)} Y_i}{N_n(x; \Theta_j, D_n)} \] where \(D_n(\Theta_j)\) is the resampled data subset, \(A_n(x; \Theta_j, D_n)\) is the cell containing \(x\), and \(N_n(x; \Theta_j, D_n)\) is the count of points in the cell.
The forest estimate for \(M\) trees is: \[ m_{M, n}(x) = \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) \] where \(M\) is the total number of trees, \(m_n(x; \Theta_j, D_n)\) represents the prediction from each tree, and the forest average yields the final prediction.
Splitting Criteria:
\[ L_{\text{reg},n}(j, z) = \frac{1}{N_n(A)} \sum_{i=1}^{n} (Y_i - \bar{Y}_A)^2 \mathbf{1}_{X_i \in A} - \frac{1}{N_n(A)} \left(\sum_{i=1}^{n} (Y_i - \bar{Y}_{AL})^2 \mathbf{1}_{X_i \in AL} + \sum_{i=1}^{n} (Y_i - \bar{Y}_{AR})^2 \mathbf{1}_{X_i \in AR}\right) \]
Stopping Condition:
Nodes are not split if they contain fewer than nodesize points or if all \(X_i\) in the node are identical.
Prediction:
\[ m_{M, n}(x) = \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) \]
Splitting Criteria:
\[ \text{Gini}(A) = 2 p_{0, n}(A) p_{1, n}(A) \]
where:
Prediction:
\[ m_{M, n}(x; \Theta_1, \ldots, \Theta_M, D_n) = \begin{cases} 1 & \text{if } \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) > \frac{1}{2} \\ 0 & \text{otherwise} \end{cases} \]
where:
| Total Orders | Closed Short | Fulfilled | |
|---|---|---|---|
| (n=7585) | (n=733) | (n=6852) | |
| Top Customers | |||
| Smoothie Island | 1701 (22.43%) | 455 (62.07%) | 1246 (18.18%) |
| Philly Bite | 1556 (20.51%) | 267 (36.43%) | 1289 (18.81%) |
| PlatePioneers | 1396 (18.40%) | 143 (19.51%) | 1253 (18.29%) |
| Berl Company | 906 (11.94%) | 5 (0.68%) | 901 (13.15%) |
| DineLink Intl | 589 (7.77%) | 42 (5.73%) | 547 (7.98%) |
| Top Products | |||
| DC-01 | 1135 (14.96%) | 345 (47.07%) | 790 (11.53%) |
| TSC-PQB-01 | 1087 (14.33%) | 389 (53.07%) | 698 (10.19%) |
| TSC-PW14X16-01 | 848 (11.18%) | 283 (38.61%) | 565 (8.25%) |
| CMI-PCK-01 | 802 (10.57%) | 288 (39.29%) | 514 (7.50%) |
| PC-05-B1 | 745 (9.82%) | 220 (30.01%) | 525 (7.66%) |
| Top Distributors | |||
| Ed Don & Company - Miramar | 210 (2.77%) | 0 (0.00%) | 210 (3.06%) |
| PFG- Gainesville | 197 (2.60%) | 0 (0.00%) | 197 (2.88%) |
| Ed Don & Company - Woodridge | 186 (2.45%) | 0 (0.00%) | 186 (2.71%) |
| Ed Don & Company - Mira Loma | 180 (2.37%) | 0 (0.00%) | 180 (2.63%) |
| .Ed Don - Miramar | 162 (2.14%) | 0 (0.00%) | 162 (2.36%) |
| Top Substrates | Paper | Plastic | Bagasse |
| Revenue($103,826,286) | $54,838,585 (52.82%) | $40,336,669 (38.85%) | $4,350,337 (4.19%) |
| Quantity Ordered | Min | Mean | Max |
| Total Ordered(1,971,237) | 1 | 61.47 | 23,160 |
| Unit Price | Min | Mean | Max |
| Key Stats | $0.16 | $62.60 | $864.00 |
| Total Price | Min | Mean | Max |
| Key Stats | $4.92 | $3,430.74 | $143,084.74 |
Predicting Customer Churn
Random Forest Model Summary
SalesOrderStatus (Fulfilled vs. Unfulfilled) using 100 trees and mtry = 2.Model Performance Metrics
Conclusions
UnitPrice and Product were the most significant predictors for classification.SalesOrderStatus (e.g., “Fulfilled” vs. “Closed Short”) and the actual status is very low beyond what could be expected by random guessing.Random Forest Model Summary
QuantityFulfilled using 100 records of sales data.Model Performance Metrics
Conclusions
QuantityFulfilled with an average error of about 28 units (RMSE).qtyOrdered, TotalPrice) are substantially more important than categorical ones.